Statistical Learning Tools for Heteroskedastic Data
نویسندگان
چکیده
Many regression procedures are affected by heteroskedasticity, or non-constant variance. A classic solution is to transform the response y and model h(y) instead. Common functions h require a direct relationship between the variance and the mean. Unless the transformation is known in advance, it can be found by applying a model for the variance to the squared residuals from a regression fit. Unfortunately, this approach additionally requires the strong assumption that the regression model for the mean is ‘correct’, whereas many regression problems involve model uncertainty. Consequently it is undesirable to make the assumption that the mean model can be correctly specified at the outset. An alternative is to model the mean and variance simultaneously, where it is possible to try different mean models and variance models together in different combinations, and to assess the fit of each combination using a single criterion. We demonstrate this approach in three different problems: unreplicated factorials, regression trees, and random forests. For the unreplicated factorial problem, we develop a model for joint identification of mean and variance effects that can reliably identify active effects of both types. The joint model is estimated using maximum likelihood, and effect selection is done using a specially derived information criterion (IC). Our method is capable of identifying sensible location-dispersion models that are not considered by methods that rely on sequential estimation of location and dispersion effects. We take a similar approach to modeling variances in regression trees. We develop an alternative likelihood-based split-selection criterion that has the capacity to account for local variance in the regression in an unstructured manner, and the tree is built using a specially derived IC. Our IC explicitly accounts for the split-selection parameter and our IC also leads to a faster pruning algorithm that does not require crossvalidation. We show that the new approach performs better for mean estimation under heteroskedasticity. Finally we use these likelihood-based trees as base learners in an ensemble much like a random forest, and improve the random forest procedure itself. First, we show that typical random forests are inefficient at fitting flat mean functions. Our first improvement is the novel α−pruning algorithm, which adaptively changes the number of observations in the terminal nodes of the regression trees depending on the flatness. Second, we show that iii random forests are inefficient at estimating means when the data are heteroskedastic, which we address by using our likelihood-based regression trees as a base learner. This allows explicit variance estimation and improved mean estimation under heteroskedasticity. Our unifying and novel contribution to these three problems is the specially derived IC. Our solution is to simulate values of the IC for several models and to store these values in a lookup table. With the lookup table, models can be evaluated and compared without needing either crossvalidation or a holdout set. We call this approach the Corrected Heteroskedastic Information Criterion (CHIC) paradigm and we demonstrate that applying the CHIC paradigm is a principled way to model variance in finite sample sizes.
منابع مشابه
The Effect of Applying Color and Light Training Materials on the Female First Grade Students’ Learning Outcome of Persian Language Lessons in Sharoud
The color images raise the level of educational by providing more detailed information; therefore, these images are believed to be effective in gaining a deeper understanding of the lessons. Moreover, Proper lighting enhances students’ learning and performance. This study aimed at assessing the impact of training color and light materials on elementary school girls’ attention and learning in Pe...
متن کاملاستفاده از سازه های ارزشمندی و رضایتمندی برای سنجش اثربخشی نظامهای یادگیری الکترونیکی
This paper has employed a novel approach for determining effectiveness of electronic learning systems. This new approach employed importance and satisfaction structures for measurement of electronic learning systems’ effectiveness from viewpoint of their users. Hadith Science Virtual College in Rey was surveyed as case study. Two matrix analysis tools of “importance-satisfaction matrix” and “...
متن کاملCold-start Active Learning with Robust Ordinal Matrix Factorization
We present a new matrix factorization model for rating data and a corresponding active learning strategy to address the cold-start problem. Coldstart is one of the most challenging tasks for recommender systems: what to recommend with new users or items for which one has little or no data. An approach is to use active learning to collect the most useful initial ratings. However, the performance...
متن کاملBayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap
This paper provides Bayesian rationalizations for White’s heteroskedastic consistent (HC) covariance estimator and various modifications of it. An informed Bayesian bootstrap provides the statistical framework.
متن کاملProvide a causal model of higher education performance in the context of the Corona Crisis based on social responsibility and the quality of e-learning
The present study is applied in terms of purpose and in terms of how to collect data, it is a descriptive research of correlational type. The statistical population of the study is 430 teachers of Apadana Higher Education Institute in 2020; Of these, 203 people were selected based on Cochranchr(chr('39')39chr('39'))s formula by simple random sampling. Data collection tools are Carroll Social Re...
متن کاملInfluence diagnostic analysis in the possibly heteroskedastic linear model with exact restrictions
The local influence method has proven to be a useful and powerful tool for detecting influential observations on the estimation of model parameters. This method has been widely applied in different studies related to econometric and statistical modelling. We propose a methodology based on the Lagrange multiplier method with a linear penalty function to assess local influence in the possibly het...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016